Here’s how to scrape 1 page of the policy database of the IEA, in a very step-by-step fashion.

The site https://www.iea.org/policies looks like this:

It’s got a table of policies, but the table doesn’t contain all the information we want about those policies.

To get that information, we will have to follow the links to those policies, and extract it from that page.

How do we get the page as data?

To get and parse (understand) the page, we can use rvest read_html()

library(rvest)
html <- read_html("https://www.iea.org/policies")
html
## {html_document}
## <html dir="ltr" lang="en-GB" class="no-js page-all-policies ">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <!-- Google Tag Manager (noscript) -->\n    <noscript>\n      ...

The variable html now holds our parsed html from the page

Now to tackle the sidebar

Some of the information we want is in a sidebar, and the html looks like this:

We can see that all of the values we want are contained inside <span> elements that have the class o-policy-aside-item__value. We can extract all of these by passing that CSS selector to html_elements() and passing the result to html_text()

values <- link_html %>% html_elements("span.o-policy-aside-item__value") %>% html_text()
values
## [1] "Japan"    "2030"     "Ended"    "National"

If we trust that these will always appear in the same order, we can simply say the first item of this vector is the country, the second is the year, etc. We can represent this as a list. Attributes of a list can be assigned with either l$attribute <- value or l["attribute"] <- value (Note the quotes).

l <- list()
l$country <- values[1]
l["year"] <- values[2]
l$status <- values[3]
l["jurisdiction"] <- values[4]

However, if we are not sure about this, we can get the keys, and get the values, and construct a list. To do this, we iterate through the vectors of keys and values

keys <- link_html %>% html_elements("span.o-policy-aside-item__title") %>% html_text()
values <- link_html %>% html_elements("span.o-policy-aside-item__value") %>% html_text()
l <- list()
for (i in 1:length(keys)) { # Do a loop were we increase the value of i from 1 to the length of our keys
  print(i) # print i 
  key = keys[i] # Get the ith key in our vector of keys
  value = values[i] # Get the ith value in our vector of values
  l[key] = value # assign the value to the attribute that is named key
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
l
## $Country
## [1] "Japan"
## 
## $Year
## [1] "2030"
## 
## $Status
## [1] "Ended"
## 
## $Jurisdiction
## [1] "National"

We can turn this list into a dataframe with as.data.frame()

link_df <- as.data.frame(l)
link_df
##   Country Year Status Jurisdiction
## 1   Japan 2030  Ended     National

Extracting Topics, Policy Types, and the other things in the boxes

Below the policy text, we can see some tags in boxes.

The html for these looks like this

Each set of tags is contained in a <div> element (which is a type of box), that has the class o_policy-content__list. Let’s select these with a css_selector

tag_boxes <-link_html %>% html_elements("div.o-policy-content__list") 
tag_boxes
## {xml_nodeset (4)}
## [1] <div class="o-policy-content__list">\n                                    ...
## [2] <div class="o-policy-content__list">\n                                    ...
## [3] <div class="o-policy-content__list">\n                                    ...
## [4] <div class="o-policy-content__list">\n                                    ...

We’ll want to process these one by one, so we can do a loop. For now, let’s just take the second one

tag_box <-tag_boxes[2]
tag_box
## {xml_nodeset (1)}
## [1] <div class="o-policy-content__list">\n                                    ...

If we look carefully at the html, we can see that within each <div> there is a <span> with the class o-policy-content-list__title. The text of that contains the type of tag. If we run html_element() on an element we have already identified (instead of the whole html), then we will look for our selection among that element’s children.

title_span <- tag_box %>% html_element("span.o-policy-content-list__title") # search within our tag_box for span elements with the class o-policy....
title <- html_text(title_span) # Get the text of that element
title
## [1] "Policy types"

Now we want to extract the tags themselves. These are contained in <span> elements that have the class a-tag__label.

tag_spans <- tag_box %>% html_elements("span.a-tag__label") # search within our tag_box for span elements with the class a-tag__label
tags <- html_text(tag_spans) # Get the text of those elements
tags
## [1] "Regulation"                                  
## [2] "Energy efficiency / Fuel economy obligations"
## [3] "Performance-based policies"

If we want to put these into a dataframe, we will need a single text value. We can use paste to paste these together.

tag_string <- paste(tags, collapse="; ") # collapse tells us to collapse a vector of strings into a single string separated by semicolons and spaces
tag_string
## [1] "Regulation; Energy efficiency / Fuel economy obligations; Performance-based policies"

Processing all sets of tags

We want to do this for each box of tags we found, so we put this in a for loop.

for (tag_box in tag_boxes) {
  title_span <- tag_box %>% html_element("span.o-policy-content-list__title") # search within our tag_box for span elements with the class o-policy....
  title <- html_text(title_span) # Get the text of that element
  
  tag_spans <- tag_box %>% html_elements("span.a-tag__label") # search within our tag_box for span elements with the class a-tag__label
  tags <- html_text(tag_spans) # Get the text of those elements
  tag_string <- paste(tags, collapse="; ") # collapse tells us to collapse a vector of strings into a single string separated by semicolons and spaces
  
  l[title] <- tag_string # We can add to the list we made earlier
}
l
## $Country
## [1] "Japan"
## 
## $Year
## [1] "2030"
## 
## $Status
## [1] "Ended"
## 
## $Jurisdiction
## [1] "National"
## 
## $Topics
## [1] "Energy Efficiency"
## 
## $`Policy types`
## [1] "Regulation; Energy efficiency / Fuel economy obligations; Performance-based policies"
## 
## $Sectors
## [1] "Road transport"
## 
## $Technologies
## [1] "Transport technologies"

Putting it all together

Now we are ready to put it all together inside a loop. In this loop we will process each link in turn. We’ll just take the first 10 links, so as not to bother the IEA too much

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- NULL # Let's initialise a null object which we will our dataframe to after each link
for (link in link_destinations[1:10]){
  # First we'll initialise an empty list to store our values
  l <- list()
  
  # Then we create our link and parse it
  absolute_link <- paste0("https://iea.org", link)
  print("processing link")
  print(absolute_link)
  link_html <- read_html(absolute_link)
  
  # Now we get our text from the paragraph as before
  p <- link_html %>% html_element("div.m-block__content p")
  l$text <- html_text2(p) # and add it to our list of attributes
  
  # Now we process the sidebar
  keys <- link_html %>% html_elements("span.o-policy-aside-item__title") %>% html_text()
  values <- link_html %>% html_elements("span.o-policy-aside-item__value") %>% html_text()
  for (i in 1:length(keys)) { # Do a loop were we increase the value of i from 1 to the length of our keys
    key = keys[i] # Get the ith key in our vector of keys
    value = values[i] # Get the ith value in our vector of values
    l[key] = value # assign the value to the attribute that is named key
  }
  
  # Now we process the tags
  tag_boxes <-link_html %>% html_elements("div.o-policy-content__list") 
  for (tag_box in tag_boxes) {
    title_span <- tag_box %>% html_element("span.o-policy-content-list__title") # search within our tag_box for the title span
    title <- html_text(title_span) # Get the text of that element
    
    tag_spans <- tag_box %>% html_elements("span.a-tag__label") # search within our tag_box for span elements with the class a-tag__label
    tags <- html_text(tag_spans) # Get the text of those elements
    tag_string <- paste(tags, collapse="; ") # collapse tells us to collapse a vector of strings into a single string separated by semicolons and spaces
    
    l[title] <- tag_string # We can add to the list we made earlier
  }
  df <- bind_rows(df, as.data.frame(l))
}
## [1] "processing link"
## [1] "https://iea.org/policies/11663-fuel-economy-standards-on-light-duty-vehicles"
## [1] "processing link"
## [1] "https://iea.org/policies/12654-emissions-limit-on-the-capacity-market-regulations"
## [1] "processing link"
## [1] "https://iea.org/policies/8506-gas-boilers-replacement-by-low-carbon-heating-systems"
## [1] "processing link"
## [1] "https://iea.org/policies/3124-local-government-fleet-renewal-mandate"
## [1] "processing link"
## [1] "https://iea.org/policies/12046-decommissioning-fossil-fuel-power-plants"
## [1] "processing link"
## [1] "https://iea.org/policies/8401-enhancements-to-minimum-energy-performance-standards-meps"
## [1] "processing link"
## [1] "https://iea.org/policies/12197-heavy-goods-vehicle-charge"
## [1] "processing link"
## [1] "https://iea.org/policies/11497-proposals-for-location-of-wind-power-turbines"
## [1] "processing link"
## [1] "https://iea.org/policies/13139-resolution-407152019-wholesale-energy-market-with-res-in-2023"
## [1] "processing link"
## [1] "https://iea.org/policies/11456-updated-meps-central-air-conditioners-and-heat-pumps"
df
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     text
## 1  Japan sets and periodically updates fuel economy standards on cars, vans and trucks under its Top Runner Program. The efficiency requirements are based on the most fuel-efficient vehicles on the market, and manufacturers and importers of these vehicles are required to meet these vehicle efficiency standards on a corporate average basis. The fuel efficiency of passenger vehicles has improved by 96% over the past two decades. Japan has announced new fuel economy standards on light duty vehicles, aiming at improving fuel efficiency by 32% by 2030, compared with 2016 levels. The scope has been expanded to cover the efficiency of electric vehicles and plug-in hybrids, and new standards take into account the energy consumption of the fuel production (gasoline and electricity), the so-called 'well-to-wheel' approach.
## 2                                                                                                                                                                                                                                                                                                                                                                The Capacity Market Regulations emissions limit aims to reduce the amount of CO2 emitted per unit of electricity. The Polish Electricity Networks made the amendment in view of adapting to EU regulations on the fulfilment of emissions limit for units participating in the capacity auctions, expected to start in July 2025. The limit is 550g carbon dioxide from fossil fuels per kWh of net electricity produced. Certification for the auction for 2025 was in September 2020.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Gas boilers in the UK will be replaced by low-carbon heating systems in all new homes built after 2025.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        All new buses and coaches that shall be acquired for public transport services from 2025 onwards must be low-emission vehicles.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Within the Strategy for the environmental policy of the Slovak Republic, the Slovak government decided to reduce the use of coal for electricity generation and has adopted an action plan to achieve this goal.\n\nA pilot project for the Upper Nitra region has been developed with the support of the European Union
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Want to know more about this policy ? Learn moreLearn more
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             As of 2023 the government intends to introduce a levy on truck traffic. This will be applied to dutch and foreign trucks of more than 3500 kg. based on the kilometer distance and roads used. The revenues will be used for innovation woards more sustainable road traffic. Relevant parties will be involved in decisions on re-investing the revenues.
## 8                                                                                                                                                                                                                                                                                                                                                                                                         These proposals consider the National Energy Independence Strategy and the National Energy and Climate plan to install a wind farm in the Baltic Sea. The wind farm would reach 700 MW and produce 2.5-3 TWh of electricity per year, which is 25% of Lithuania's electricity demand. It may take up to 8 years to install and the territory planned in the Baltic Sea covers an area of 137.5 square km, with an average wind speed of 9 m/s.
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              To reach a more eco-friendly energy matrix, the Ministry of Mines and Energy approved resolution 40715 that stipulates that, as of 2023, at least 10% of electricity purchases of wholesalers of the Wholesale Energy Market destined to serve end users must come from renewables (FNCER), through long-term contracts. 
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                 This policy applies to residential central air conditioners and heat pumps installed as part of a home's central heating and cooling system. Residential central air conditioners and heat pumps include split system central air conditioners and heat pumps; single package central air conditioners; single package heat pumps; small-duct high-velocity products; and space constrained products.
##            Country Year  Status Jurisdiction            Topics
## 1            Japan 2030   Ended     National Energy Efficiency
## 2           Poland 2025 Planned     National              <NA>
## 3   United Kingdom 2025 Planned     National Energy Efficiency
## 4           France 2025 Planned     National Energy Efficiency
## 5  Slovak Republic 2023 Planned     National              <NA>
## 6        Singapore 2023 Planned     National Energy Efficiency
## 7      Netherlands 2023 Planned     National Energy Efficiency
## 8        Lithuania 2023 Planned     National  Renewable Energy
## 9         Colombia 2023 Planned     National  Renewable Energy
## 10   United States 2023 Planned     National Energy Efficiency
##                                                                                                                                     Policy.types
## 1                                                           Regulation; Energy efficiency / Fuel economy obligations; Performance-based policies
## 2  Strategic plans; Codes and standards; Nationally Determined Contribution; Targets, plans and framework legislation; Climate change strategies
## 3                                                                                                       Regulation; Other regulatory instruments
## 4                                                                                                                                     Regulation
## 5                                                                                     Strategic plans; Prohibition; Technology bans / phase outs
## 6                          Regulation; Codes and standards; Product-based MEPS; Minimum energy performance standards; Performance-based policies
## 7                                          Payments, finance and taxation; Taxes, fees and charges; Use and activity charges; Road usage charges
## 8                                                                                                                                           <NA>
## 9         Regulation; Mandatory energy management system; Prescriptive requirements and standards; Energy market regulation; Market design rules
## 10                                                                         Regulation; Codes and standards; Minimum energy performance standards
##                                                                                                                    Sectors
## 1                                                                                                           Road transport
## 2                                             Power, Heat and Utilities; Power generation; Electricity and heat generation
## 3                                                                                                                Buildings
## 4  Transport; Road transport; Passenger transport (Road); Mass road transit; Buses and minibuses - Local and urban service
## 5                          Combined heat and power; Fuel processing and transformation; Coal secondary products production
## 6                                                                                         Buildings; Residential; Services
## 7                                                                                                           Road transport
## 8                                                                                                                     <NA>
## 9                                                           Power, Heat and Utilities; Power transmission and distribution
## 10                                                                                                  Buildings; Residential
##                                                                                                                                                 Technologies
## 1                                                                                                                                     Transport technologies
## 2                                                                                                                                                       <NA>
## 3                                                                         Space, water and process heating technologies; Domestic and building-scale boilers
## 4                            Road vehicles; Buses and coaches; Drive train or engine; Battery electric; Plug-in hybrid; Transport technologies; Vehicle type
## 5                                                                                                                                                       <NA>
## 6  Lighting technologies; Exterior lighting (incl. street); Light producing technologies; Incandescent; Compact fluorescent lamp; Light emitting diode (LED)
## 7                                                                                                                                                       <NA>
## 8                                                                                                                                                       <NA>
## 9                                                                                                                                                       <NA>
## 10                                            Space cooling; Centralised AC system; Airconditioners (ACs); Heating, cooling and climate control technologies